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Abstract 

A vast variety of biological, social, and economical networks shows topologies dras- 
tically differing from random graphs; yet the quantitative characterization remains 
unsatisfactory from a conceptual point of view. Motivated from the discussion of 
small scale-free networks, a biased link distribution entropy is defined, which takes 
an extremum for a power law distribution. This approach is extended to the node- 
node link cross-distribution, whose nondiagonal elements characterize the graph 
structure beyond link distribution, cluster coefficient and average path length. From 
here a simple (and computationally cheap) complexity measure can be defined. This 
Offdiagonal Complexity (OdC) is proposed as a novel measure to characterize the 
complexity of an undirected graph, or network. While both for regular lattices and 
fully connected networks OdC is zero, it takes a moderately low value for a random 
graph and shows high values for apparently complex structures as scale-free net- 
works and hierarchical trees. The Offdiagonal Complexity apporach is applied to the 
Helicobacter pylori protein interaction network and randomly rewired surrogates. 



1 Introduction 



While random graph theory and scale-free network research know a set of 
standard measures to quantify their properties, the question of complexity 
of a graph still is in its infancies. A 'blind' application of other complexity 
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measures (as for binary sequences or computer programs) does not account 
for the special properties shared by graphs and especially scale-free graphs. 
Moreover, some known complexity measures themselves have a high compu- 
tational complexity. 

Since a series of seminal papers (Watts & Strogatz [1], Barabasi & Albert [2] 
[2,3], Newman [4], Dorogovtsev & Mendes [5]) since 1999 (see also [6] for an 
overview), small- world and scale- free networks are a hot topic of investigation 
in a broad range of systems and disciplines. Metabolic and other biological 
networks, collaboration networks, www, internet, etc., have in common that 
the distribution of link degrees follows a power law, thus has no inherent scale. 
Such networks are termed 'scale-free networks'. Compared to random graphs, 
which have a Poisson link distribution and thus a characteristic scale, they 
share a lot of different properties, especially a high clustering coefficient, and 
a short average path length. 

Mathematically, a graph (or synonymously in this context, a network) is de- 
fined by a (nonempty) set of nodes, a set of edges (or links), and a map that 
assigns two nodes (the "end nodes" of a link) to each link. In a computer, a 
graph may be represented either by a list of links, represented by the pairs 
of nodes, or equivalently, by its adjacency matrix whose entries are 1 (0) 
if nodes i, j are connected (disconnected). Useful generalizations are weighted 
graphs, where the restriction of is relaxed from binary values to (unsu- 
ally nonnegative) integer or real values (e.g. resistor values, travel distances, 
interaction coupling), and directed graphs, where a^- no longer needs to be 
symmetric, and the link from % to j and the link from j to % can exist inde- 
pendently (e.g. links between webpages, or scientific citations). 
Here the discussion will be kept limited to binary undirected graphs, like an 
acquaintancy network or a railway network as shown below. In the following 
sections the link (degree) distribution and the next order cross-distribution 
are investigated and taken as a basis for a complexity measure. 



2 Other complexity measures 



For text strings (as computer programs, or DNA) there are common com- 
plexity measures in theoretical computer science, as Kolmogorov complexity 
(and the related Lempel-Ziv complexity and algorithmic information content 
AIC) [8]. E.g., AIC is defined by the length of the shortest program generating 
the string. For random structures, thus also for random graphs, they indicate 
high complexity. A distinction of complex structured (but still partly random) 
structures from completely random ones usually is prohibitive for this class 
of measures. For this reason, measures of effective complexity [9] have been 
discussed; usually these are defined as an entropy (or description length) of 
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"a concise description of a set of the entity's regularities" [9]. Here we are 
mainly interested in this second class, and straightforwardly one would try 
to apply existing measures, e.g., to the link list or to the adjacency matrix. 
However, mathematically it is not straightforward to apply these text string 
based measures to graphs, as there is no unique way to map a graph onto a 
text string. For the case of hierarchical structures, which can be represented 
by trees, Ceccatto and Huberman quantified complexity from the diversity of 
the subtrees [7] . As natural networks typically exhibit an occurrence triangles 
and higher oder loops in a nonneglectable way, other approaches have to be 
chosen for networks in general. 

Thus one desires to use complexity measures that are defined directly for 
graphs. Two classical measures are known from graph theory, graph thickness 
and coloring number have a low "resolution" (typically integer values up to 
4), and their relevance for real networks is not clear. Two new complexity 
measures recently have been proposed for graphs, Medium Articulation [10] 
for weighted graphs (as they appear in foodwebs) and a measure for directed 
graphs by Meyer-Ortmanns [11] based on the network motif concept [12]). 
Unfortunately, the latter two complexity measures are computationally quite 
costly. A computational complexity approach has been defined by Machta 
and Machta [13] as computational depth of an ensemble of graphs (e.g. small- 
world, scalefree, lattice). It is defined as the number of processing time steps 
a large parallel computer (with unlimited number of processors) would need 
to generate a representative member of that graph ensemble. Unlike other 
approaches, it does not assign single complexity values to each graph, and 
again is nontrivial to compute. 

Following [9], an especially desired property of a complexity measure should 
be the ability to distinguish nonrandom complex structures from both pure 
randomness and regular structures as lattices. In this instance, the effective 
complexity and the Machta approach fulfill this prerequisites perfectly, but 
up to today no simple method is available to compute them. Hence, a simpler 
estimator of graph complexity is desired, and one possible approach, the Offdi- 
agonal Complexity, is proposed here. It is motivated by a striking observation 
on the node-node link correlation matrices of complex networks [14], namely 
that entries are more evenly spread among the offdiagonals, compared to both 
regular lattices and random graphs (see Figs. 4 and 5 for a comparison). This 
can be used to define a complexity measure, for undirected graphs [14,15]. 



This article is organized as follows. In Sec. 3 the approach is motivated from 
link entropies and node-node correlations. In Sec. 4 OdC is defined. Section 5 
investigates the application of OdC to a protein interaction network, compared 
with randomized surrogates. 
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3 Motivation of OdC 



3. 1 Node degree correlations: Methods of classical statistics 

A straightforward mathematical approach to study node-node link correla- 
tions, i.e. correlations between degrees of pairs of nodes, is to use rank corre- 
lation methods [16] from classical statistics to analyze the link distributions. 

Two common rank correlation methods can be described as follows. One con- 
siders a list of rank numbers of link numbers (node degrees). For each of the 
two graphs (A and B) to be compared, there is a (ordered) list of link num- 
bers (hi, k 2 , • • • k N ) = 5 2 2 1 1 1, and one assigns a rank number to each link, 
(rf, , . . . rfc) = (1 2.5 2.5 5 5 5). Hereby the identical second and third ranks 
are replaced by the (noninteger) average value; as node degrees are highly de- 
generate, this will occur frequently. 

Then the Kendall tau coefficient is defined as t — . - — where cr™ = ±1, 

n(n — 1) 

if pairs of elements are ranked in both lists equally (resp. non equally), 
o~ij = sgn(rf — rf) • sgn(rf — rf). Its apparent drawbacks within this context 
are the required costly computations (n 2 ), and it seems to be analytically not 
easy to handle, as one must have the nodes sorted by their degree, for each 
member of (e.g.) an ensemble average. 

The second main rank correlation method is Spearman's rho, defined by 

r s = 1 1 , where di = r A — rf. — Some of its properties are: 

n 6 — n 

1 2 3 

r s = +1 for identical rank lists 

1 2 3 

1 2 3 

r s = —1 for counter-sequenced rank lists 

3 2 1 

r s = +| if a sequence is constant = w(w ~ 1 - ) . (One might wonder why not 
r s = holds here. However for n = 3 always r s ^ holds; but the average 
over all possible rank lists vanishes, (r s ) = 0.) 

In general, rank correlation methods are not appropriate for a high degeneracy, 
i.e. a large number of nodes with the same number of links. 

Thus, it is desired to formulate other measures that can estimate the complex- 
ity of a graph from correlation information of pairs of nodes. The approach 
of this paper is to define an entropy-type measure. To motivate the ansatz, 
the problem of binning and the definition of a link entropy is discussed in the 
next section. 
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3.2 Fit of sparse power-law distributions 



The fit of sparse distributions by binning has to cope with the problem of 
zeroes and with the effect of arbitrariness of the choice of interval length and 
position. As an example we consider the link distribution of a traffic network 
[17] (see Fig. 1). 

As intervals have to be chosen so that no zeroes occur (— oo in log scale), 
one has the choice between different 'tricks' (influencing the fit): (i) irregular 
intervals: choice influences fit, or (ii) regular intervals n max • \[2 ln(c • exp(fc)), 
however they imply a severe reduction of the number of intervals. Even the 
two remaining parameters influence the result (esp. for large link numbers): 
(see Fig. 2b). A moderately 'clean' method is to place the entry with largest 
link number in the middle of that interval. A parameter-free approach is the 
integrated density. For a power law density with exponent a > 1, one has 



x l-a 



dk k~ a 



a — 1 ' 



Instead of the density itself, the integrated density can be fitted (see Fig. 2c). 



For exact results, a discretization correction is necessary: c™ 



/ dk k~ 

J n 



Alternatively, from Y^=n k ' P(k) one g e ts a plot with the same slope as p(k) 
itself. 
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3.3 Entropy of the link distribution 



As demonstrated in sec. 3.2, the estimation of the scale exponent from a 
measured distribution by binning has inherent degrees of freedom; this can 
be overcome by a fit of the integrated density. To estimate the entropy of a 
distribution density) with sampling gaps however leads to underestimation 
(Grassberger [18]). A straightforwardly defined link distribution entropy H = 
— SfePfe m Pfe becomes extremal for the equidistribution (and not for a power 
law). Power law candidate distribution are usually logarithmically binned. 
However, for a power law one obtains a distribution with linear decay (in the 
binned log-log space, as in Fig. 2b, c), and not an equidistribution, and again 
H not maximal. 

This problem is solved by defining a "Biased Link Entropy" (showing an ex- 
tremum w.r.t. a, see Fig. 3; the transformed density is the equidistribution for 
proper choice of a). With the necessary normalization N(a) = J2kk a Pk/8k, 
here 5k may be a binning interval width, the biased link entropy reads 



u( , k >k . k a p k 



, 6 k N(a) S k N(aY 
3.4 Node degree correlations: Entropy approach? 

The idea now is to use entropies instead of correlations or rank correlations. 
Naively one would use define an entropy of all coefficients of the node de- 
gree correlation matrix p kh H = — J^kiPki^Pki- However, then any invari- 
ances like (k 1: k 2 ) — > (2k 1 ,2k 2 ) or (k 1: k 2 ) — > (k + ki,k + k 2 ) are lost, 
but such invariances would be desired for different description levels of the 
systems. Annother possible approach could be via the Kullback-Leibler Dis- 
tance D(p A ,p B ) = J2i vt l n G°f I 'pf) ■ Here, one has to apply it to the node 
degree kf-, kf for each link i. However, this is generically nonsymmetric (for 
a symmetrized definition see [19]), and again, there is no invariance for e.g. 
(ki,k 2 ) — > (2ki,2k 2 ). — As a last approach, one could define a Biased Cross 
Link Entropy by replacing kf by k 1 ■ kf. — This discussion shows that simple 
definitions via link entropies bear difficulties. 
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Fig. 1. Example of a small network: The Intercity railway (plus flyway) network in 
Germany approximatively shows a scale-free link distribution (see Figs. 2 and 3). 



P(k) P(k) P(k) 




Fig. 2. (a) Problem of zeroes (see text), (b) Result of different binnings depending 
on parameters c and n max . (c) The Integrated density is defined free of parameters. 



P(k) H 




(a) k (b) 



Fig. 3. (a) The biased entropy of the distribution shows an extremum with respect 
to the exponent a (b). From here, we have a parameter- free estimation of a. 
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Fig. 4. Small non-complex networks: These networks are large, not complex, and 
not scale-free. A single entry or a single diagonal with nonzero entries indicates low 
complexity. Shown are a regular lattice in ID and 2D (top) and a Bethe lattice 
and a star graph (bottom) The third example (middle) is the box-plane-stick-loop 
concatenation of different-dimensional finite lattices, widely used as data analysis 
test set. 
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Fig. 5. Small complex networks: A striking observation is that entries are quite 
evenly spread on the offdiagonals. Can this be used to define a complexity measure? 
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4 Definition of the Offdiagonal Complexity (OdC) 

Let Qij be the adjacency matrix of a graph with N nodes, i.e., g^ = 1 if nodes 

% and j are connected, else = 0. Then OdC is defined as follows [15]. 

(i) For each node i, let be the node degree, i.e. the number of edges (links), 



N-l 

m ■= E 9a (2) 

3=0 

(ii) Let c mn be the number of edges between al pairs of nodes i and j, with 
node degrees m = n = with l(J) > (ordered pairs), i.e., 



N-l N-l 

C mn := E E 9ijtim,Hi)8n,lV)H(l(i) ( 3 ) 
3=0 3=0 

Here 5 is the Kronecker symbol and H(x) = 1 for x < and H(x) = for 
x < 0. Due to the pair odering, the matrix c mn has entries only on the main 
diagonal and above. Thus, c mn is a (not normalized) node-node link correlation 
matrix. 

(iii) Summation over the minor diagonals, or offdiagonals, i. e. all pairs with 
same ki kj up to fc max 

= minj{/(i)}, and normalization, 



^max fe fcmax 

i=0 fcO 

(iv) Then OdC is defined as an entropy measure on this normalized distribu- 
tions (here it is understood that 01n(0) = 0), 



OdC = - ^2 a fclna fc . (5) 

fc=0 



OdC is an approximative complexity estimator that takes as values zero for a 
regular lattice (an orthogonal n-dimensional lattice with periodic boundaries 
consists of bulk nodes with 2n neighbors. Thus C2«,2n = 1 is the only en- 
try; for this regular structure OdC vanishes. Also a 2-dimensional hexagonal 
lattice has only one entry), zero for a fully connected graph, low values for 
a random graph, and higher values for 'apparently complex' structures. One 
main advantage is that it does not involve costly (high-order or NP-complete) 
computations. 
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5 Application to the Helicobacter pylori protein interaction graph 
and reshuffling to a random graph 

To demonstrate that OdC can distinguish between random graphs and com- 
plex networks, the Helicobacter pylori protein interaction graph [20] has been 
chosen. For different rewiring probabilities p and 10 2 realizations each, the 
links have been reshuffled, ending up with a random graph for p = 1. As can 
be seen in Fig. 6, rewiring in any case lowers the Off diagonal Complexity. 




Helicobacter - Rewiring probability p - Random graph 



Fig. 6. OdC for random reshufflings of the Helicobacter pylori network (left, p = 0) 
up to a rewiring probability of p = 1 (right). The bold line shows the average, five 
OdC trajectories along a rewiring path are shown for illustration (thin lines). 

6 Conclusions and Outlook 

A new complexity measure for graphs and networks has been proposed. The 
motivation of its definition is twofold: One observation is that the binning of 
link distributions is problematic for small networks. Herefrom the second ob- 
servation is that if one uses instead of the (plain) entropy of link distribution, 
which is unsignificant for scale-free networks, a "biased link entropy", it has 
an extremum where the exponent of the power law is met. 
The central idea of OdC is to apply an entropy measure to the degree correla- 
tion matrix, after summation over the offdiagonals. This allows for a quantita- 
tive, yet still approximative, measure of complexity. OdC roughly is 'hierarchy 
sensitive' and has the main advantage of being computationally not costly. 

Acknowledgments. J.C.C. thanks Christian Starzynski for providing the 
simulation code for Fig. 6, and an anonymous referee for constructive remarks. 
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